<!DOCTYPE html>
<html>
<head>

<meta content="text/html; charset=UTF-8" http-equiv="Content-Type">

<link rel="stylesheet" href="http://netdna.bootstrapcdn.com/bootstrap/3.0.3/css/bootstrap.min.css">
<link rel="stylesheet" href="http://netdna.bootstrapcdn.com/bootstrap/3.0.3/css/bootstrap-responsive.min.css">

<style type="text/css">
.nav { }
.nav li { float: left; width: 110px; }
.container { background-color: #ffffff; padding: 30px; }
.content { padding: 30px; }
body, h1, h2, h3, h4 { font-family: "Trebuchet MS","Helvetica Neue",Arial,Helvetica,sans-serif,"Georgia";}
body { background-color: #eeeeee; }
header { width: 100%; }
blockquote p { font-size: 14px; }
</style>
<title>ZPar | Chinese Word Segmentation</title>
</head>

<body>
<div class="table-bordered container">
<div class="content">

<header>
<div class="page-header text-center">
<h1>Chinese Word Segmentation</h1>
</div>
</header>

<h2><a id="How-to-compile">How to compile</a></h2>
<p>
Suppose that ZPar has been downloaded to the directory <code>zpar</code>. To make the segmentor system, type <code>make segmentor</code>. This will create a directory <code>zpar/dist/segmentor</code>, in which there are two files: <code>train</code> and <code>segmentor</code>. The file <code>train</code> is used to train a segmentation model,and the file <code>segmentor</code> is used to segment new texts using a trained segmentation model.
</p>

<h2><a id="Format-of-inputs-and-outputs">Format of inputs and outputs</a></h2>
<p>
The input files to the segmentor are formatted as a sequence of Chinese characters. An example input is:
</p>
<pre><code> ZPar可以分析中文和英文</code></pre>

<p>
The output files contain space-separated words:
</p>
<pre><code> ZPar 可以 分析 中文 和 英文</code></pre>

<p>
The output format is also the format of training files for the <code>train</code> executable.
</p>

<p>
Both input and output files must be enemd in <em>utf8</em>. Here is a <a href="seg_files/gb2utf.py">script</a> that transfers files that are enemd in <em>gb</em> to the <em>utf8</em> encoding.
</p>

<h2><a id="How-to-train-a-model">How to train a model</a></h2>

<p>
To train a model, use
</p>
<pre><code> zpar/dist/segmentor/train &lt;train-file&gt; &lt;model-file&gt; &lt;number of iterations&gt;</code></pre>

<p>
For example, using the <a href="seg_files/train.txt">example train file</a>, you can train a  model by
</p>
<pre><code> zpar/dist/segmentor/train train.txt model 1</code></pre>

<p>
After training is completed, a new file <code>model</code> will be created in the current directory, which can be used to segment raw Chinese sentences. The above command performs training with one iteration
(see <a href="#tuning">How to tune the performance of a system</a>
) using the training file.
</p>

<h2><a id="How-to-segment-new-texts">How to segment new texts</a></h2>

<p>
To apply an existing model to segment new texts, use
</p>
<pre><code> zpar/dist/segmentor/segmentor &lt;model&gt; [&lt;input-file&gt;] [&lt;output-file&gt;]</code></pre>

<p>
where the input file and output file are optional. If the output file is not specified, segmented texts will be printed to the console. If the input file is not specified, raw texts will be read from the console. For example, using the model we just trained, we can segment <a href="seg_files/input.txt">an example input</a> by
</p>
<pre><code> zpar/dist/segmentor/segmentor model input.txt output.txt</code></pre>

<p>
The output file contains automatically segmented texts.
</p>

<h2><a id="Outputs-and-evaluation">Outputs and evaluation</a></h2>

<p>
Automatically segmented texts contain errors. In order to evaluate the quality of the outputs, we can manually specify the segmentation of a sample, and compare the outputs with the correct sample.
</p>

<p>
A manually specified segmentation of the input file is given in <a href="seg_files/reference.txt">this example reference file</a>. Here is a <a href="seg_files/evaluate.py">Python script</a> that performs automatic evaluation.
</p>

<p>
Using the above <code>output.txt</code> and <code>reference.txt</code>, we can evaluate the accuracies by typing
</p>
<pre><code> python evaluate.py output.txt reference.txt</code></pre>

<p>
You can find the precision, recall, and f-score here.
</p>

<h2><a id="tuning">How to tune the performance of a system</a></h2>

<p>
The performance of the system after one training iteration may not be optimal. You can try training a model for another few iterations, after each you compare the performance. You can choose the model that gives the highest f-score on your test data. We conventionally call this test file the development test data, because you develop a segmentation model using this. Here is <a href="seg_files/test.sh">a shell script</a> that automatically trains the segmentor for 30 iterations, and after the ith iteration, stores the model file to model.i. You can compare the f-score of all 30 iterations and choose model.k, which gives the best f-score, as the final model. In this file, there is a variable called <code>segmentor</code>. You need to set this variable to the relative directory of <code>zpar/dist/segmentor</code>.
</p>

<h2><a id="Source-code">Source code</a></h2>
<p>
The source code for the segmentor can be found at
</p>
<pre><code> zpar/src/chinese/segmentor/implementation/SEGMENTOR_IMPL</code></pre>

<p>
where SEGMENTOR_IMPL is a macro defined in <code>Makefile</code>, and specifies a specific implementation for the segmentor.
</p>

<h2><a id="reference">Reference</a></h2>
<ul>
<li>
Yue Zhang and Stephen Clark. 2007. Chinese Segmentation Using a Word-based Perceptron Algorithm. In <em>Proc. of ACL</em>. pages 840-847.
</li>
</ul>

</div>
</div>
<footer class="text-center">
<p>
ZPar Release 0.7
</p>
</footer>
</body>
</html>
